This tutorial aims to show the general workflow of how land cover classifications (or similar tasks) based on satellite data can be performed in R using machine learning algorithms.
The example prediction task is to perfom a supervised land cover classification for Münster in Germany. The dataset to do this includes selected spectral channels of a Sentinel-2 scene as well as derived artificial channels (NDVI as well as the standard deviation of the NDVI in a 5x5 pixel environment). As resposne (reference/ground truth) we use digitized polygons that were created on the basis of expert knowledge with support of high resolution data (aerial images).
For this tutorial we need the raster package for processing of the satellite data as well as the caret package as a wrapper for machine learning (here: randomForest) algorithms. Sf is used for handling of the training data available as vector data (polygons). Mapview is used for spatial visualization of the data. CAST will be used to account for spatial dependencies during model validation.
rm(list=ls())
#major required packages:
library(raster)
library(caret)
library(mapview)
library(sf)
library(CAST)
library(tmap)
To start with, let’s load and explore the remote sensing raster data as well as the vector data that include the training sites.
sen_ms <- stack("data/Sen_Muenster.grd")
print(sen_ms)
## class : RasterStack
## dimensions : 664, 798, 529872, 9 (nrow, ncol, ncell, nlayers)
## resolution : 10, 10 (x, y)
## extent : 400620, 408600, 5753560, 5760200 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=32 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## names : B02, B03, B04, B08, B06, B07, B11, NDVI, NDVI_sd_5
## min values : 3.750000e+01, 2.115000e+02, 1.000000e+00, 1.000000e+00, 3.131562e+02, 2.798438e+02, 1.103750e+02, -9.969466e-01, 3.970687e-05
## max values : 1.458250e+04, 1.578750e+04, 1.876750e+04, 1.992900e+04, 1.097297e+04, 1.164534e+04, 1.375694e+04, 9.988630e-01, 1.573892e-02
The RasterStack contains a subset of the optical data from Sentinel-2 (see band information here: https://en.wikipedia.org/wiki/Sentinel-2) given in scaled reflectances (B02-B11). In addition,the NDVI was calculated and spatial context is included as the standard deviation of the NDVI in a 5x5 pixel environment (NDVI_sd_5). Let’s plot the rasterStack to get an idea how the variables look like.
plot(sen_ms)
The vector file is read as sf object. It contains the training sites of 7 Land cover classes. These are polygons (33 in total) that were digitized in QGIS on the basis of the Sentinel data and with support of an aerial image and using expert knowledge. They can be regarded here as a ground truth for the land cover classification.
trainSites <- read_sf("data/trainingsites_muenster.gpkg")
print(trainSites)
## Simple feature collection with 33 features and 3 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 400682 ymin: 5753925 xmax: 408327.9 ymax: 5760113
## projected CRS: WGS 84 / UTM zone 32N
## # A tibble: 33 x 4
## ClassID Label PolygonID geom
## <dbl> <chr> <int> <POLYGON [m]>
## 1 11 Planted … 1 ((402789.2 5759326, 402850.9 5759266, 402901 575…
## 2 11 Planted … 2 ((403154.2 5759459, 403357.6 5759470, 403373 575…
## 3 11 Planted … 3 ((403124.5 5759788, 403314.6 5759806, 403397.5 5…
## 4 14 Inland w… 4 ((404809.3 5757129, 404865.6 5757163, 404865.6 5…
## 5 14 Inland w… 5 ((403729 5756102, 403797.5 5756077, 403770.6 575…
## 6 14 Inland w… 6 ((408139 5759018, 408201.5 5759011, 408189.7 575…
## 7 14 Inland w… 7 ((406992.5 5756470, 407110.9 5756441, 407184.3 5…
## 8 14 Inland w… 8 ((406334.9 5755042, 406351.1 5755068, 406382.8 5…
## 9 4 Grassland 9 ((403302.3 5755592, 403332.2 5755597, 403337.9 5…
## 10 4 Grassland 10 ((405602.6 5760078, 405610.9 5760104, 405628.3 5…
## # … with 23 more rows
Using mapview’s viewRGB function we can visualize the aerial image channels as true color composite in the geographical context and overlay it with the polygons. Click on the polygons to see which land cover class is assigned to a respective polygon.
viewRGB(sen_ms, r = 3, g = 2, b = 1, map.types = "Esri.WorldImagery")+
mapview(trainSites)
In order to train a machine learning model between the spectral properties and the land cover class, we first need to create a data frame that contains the predictor variables at the location of the training sites as well as the corresponding class information. This data frame can be produced with the extract function. The resulting data frame contains the predictor variables for each pixel overlayed by the polygons. This data frame then still needs to be merged with the information on the land cover class from the sf object.
extr <- extract(sen_ms, trainSites, df=TRUE)
extr <- merge(extr, trainSites, by.x="ID", by.y="PolygonID")
head(extr)
## ID B02 B03 B04 B08 B06 B07 B11 NDVI NDVI_sd_5
## 1 1 921.5 877 469 4936.0 3456.156 4704.219 1699.156 0.8264570 0.003605092
## 2 1 925.0 880 486 4918.0 3482.719 4717.906 1696.969 0.8201332 0.003988799
## 3 1 918.0 878 491 4907.0 3405.031 4582.531 1722.375 0.8180808 0.006624836
## 4 1 921.0 870 473 4924.0 3570.250 4891.750 1668.125 0.8247175 0.002481608
## 5 1 908.0 872 454 4899.5 3584.062 4919.188 1657.312 0.8303913 0.001217456
## 6 1 896.0 856 466 4895.5 3579.688 4897.812 1649.438 0.8261681 0.001640807
## ClassID Label geom
## 1 11 Planted field POLYGON ((402789.2 5759326,...
## 2 11 Planted field POLYGON ((402789.2 5759326,...
## 3 11 Planted field POLYGON ((402789.2 5759326,...
## 4 11 Planted field POLYGON ((402789.2 5759326,...
## 5 11 Planted field POLYGON ((402789.2 5759326,...
## 6 11 Planted field POLYGON ((402789.2 5759326,...
In order to speed things up, for this tutorial we will reduce the data. Therefore, from each training polygon only 5% of the pixels will be used for model training. Therefore, from each polygon 5% of the pixels are randomly drawn.
set.seed(100)
trainids <- createDataPartition(extr$ID,list=FALSE,p=0.05)
trainDat <- extr[trainids,]
For model training we need to define the predictor and response variables. As predictors we can use basically all information from the raster stack as we might assume they could all be meaningful for the differentiation between the land cover classes. As response variable we use the “Label” column of the data frame.
predictors <- names(sen_ms)
response <- "Label"
We then train a Random Forest model to lean how the classes can be distinguished based on the predictors (note: other algorithms would work as well. See https://topepo.github.io/caret/available-models.html for a list of algorithms available in caret). Caret’s train function is doing this job. Before starting model trainign we can specify some control settings using trainControl. For hyperparameter tuning (mtry) as well as for error assessment we use a spatial 3-fold cross-validation. Therefore the training data are split into 3 folds but data from the same polygon are always grouped so that they never occur in both, training and testing. Also we make sure that each fold contains data from each land cover class. CAST’s CreateSpacetimeFolds is doing this job when we specify the polygon ID and the class label.
indices <- CreateSpacetimeFolds(trainDat,spacevar = "ID",k=3,class="Label")
ctrl <- trainControl(method="cv",
index = indices$index,
savePredictions = TRUE)
Model training is then performed using caret’s train function. We specify “rf” as method, indicating that a Random Forest is applied. For model training we reduce the number of trees (ntree) to 75 to speed things up. Note that usually a larger number (>250) is appropriate. We use the Kappa index for validation.
# train the model
set.seed(100)
model <- train(trainDat[,predictors],
trainDat[,response],
method="rf",
trControl=ctrl,
importance=TRUE,
ntree=75)
print(model)
## Random Forest
##
## 571 samples
## 9 predictor
## 7 classes: 'Fallow field', 'Grassland', 'Industrial', 'Inland water', 'Mixed forest', 'Planted field', 'Urban'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 389, 340, 413
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9401007 0.9248781
## 5 0.9594245 0.9487587
## 9 0.9527069 0.9403948
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
plot(varImp(model))
When we print the model (see above) we get a summary of the prediction performance as the average Kappa and Accuracy of the three spatial folds. Looking at all cross-validated predictions together we can get the “global” model performance.
# get all cross-validated predictions:
cvPredictions <- model$pred[model$pred$mtry==model$bestTune$mtry,]
# calculate cross table:
table(cvPredictions$pred,cvPredictions$obs)
##
## Fallow field Grassland Industrial Inland water Mixed forest
## Fallow field 66 0 0 0 0
## Grassland 0 20 0 0 0
## Industrial 2 0 32 0 0
## Inland water 0 0 0 76 0
## Mixed forest 0 0 0 0 126
## Planted field 0 1 0 0 0
## Urban 3 0 4 0 0
##
## Planted field Urban
## Fallow field 0 4
## Grassland 1 0
## Industrial 0 4
## Inland water 0 1
## Mixed forest 1 2
## Planted field 70 0
## Urban 0 158
We see that the performance is very high and that only minor false classifications occur.
To perform the classification we can use the trained model and apply it to each pixel of the raster stack using the predict function. Then we can then create a map with meaningful colors of the predicted land cover using the tmap package.
prediction <- predict(sen_ms,model)
cols <- c("sandybrown", "green", "darkred", "blue", "forestgreen", "lightgreen", "red")
tm_shape(deratify(prediction)) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)